Spectroscape enables real-time query and visualization of a spectral archive in proteomics

In proteomics, spectral archives organize the enormous amounts of publicly available peptide tandem mass spectra by similarity, offering opportunities for error correction and novel discoveries. Here we adapt an indexing algorithm developed by Facebook for organizing online multimedia resources to tandem mass spectra and achieve practically instantaneous retrieval and clustering of approximate nearest neighbors in a large spectral archive. An interactive web-based graphical user interface enables the user to view a query spectrum in its clustered neighborhood, which facilitates contextual validation of peptide identifications and exploration of the dark proteome.

The consensus library spectrum is near the center of a cluster, shown as the node with dashed outline.This represents the ideal scenario where the replicate spectra are all highly similar to each other, and the averaging scheme to produce the consensus spectrum works as intended.On the right, a butterfly plot showing the spectrum-spectrum match between the library spectrum (down) and one of the replicates (up, its node indicated by red arrow).As shown, even for a neighbor at the edge of the cluster, the spectra are extremely similar.(b) Two clusters colored by pink and green are connected with orange edges, which indicate a precursor mass difference.The spectra in pink and green are identified to peptide KGSLESPATDVFGSTEEGEKR/2, and GSLESPATDVFGSTEEGEKR/2, respectively, the leading K being the only difference and accounting for the precursor mass difference.The butterfly plot on the right shows that the two spectra share many y ions but no b-ions due to the difference at the N-terminus.
Supplementary Figure 4 | The star connectivity with concentric circle plot.The user can use this visualization feature by selecting "Star" in the "Connectivity" menu in the graph edge parameters (purple box).The star visualization shows only nodes connected to the query node as opposed to all edges between "neighbor" nodes.The concentric circle visualization places each neighboring node into an "orbit" divided according to a dot-product threshold.The star connectivity plot is better for showing the neighbors in the order of similarity.The zoomed-in picture below shows that the query (identified to N[115]LVHIITHGEEKD/4 where N[115] is deamidated asparagine) has neighbors of slightly different sequences, including the library spectrum of unmodified NLVHIITHGEEKD/4 and the library spectra of the N-terminal truncated peptides LVHIITHGEEKD/3 and VHIITHGEEKD/3 of different charge states further away.The orange edges indicate a difference in precursor mass between nodes.The mass difference of deamidation (+1 Da), on the other hand, is within the user-defined tolerance for precursor mass difference(5 Da as shown above as parameter "minimum Δmass to highlight"), is therefore shown as black.

Supplementary Figure 5 |
The "butterfly" spectrum-to-spectrum match viewer.Two spectra can be shown "head-to-tail" for easier comparison, and each can be re-annotated with respect to the identification of the other node.The bottom spectrum is of the query node, which is identified by the search engine as LGPALATGNVVVMK/2.The top spectrum is that of the query spectrum annotated as the selected query node which was identified as HVNPVQALSEFK/2.This feature is helpful for correcting the identification of a query node based on its neighbors.Supplementary Figure 7 | The node table.This feature shows all the neighbors within a certain proximity.The number of neighbors parameter in the Search tab determines the number of nodes shown in the table.By double clicking a column, the table can be sorted according to that column, allowing us to quickly group nodes according to attributes such as Sequence, Probability, and File name.The search bar can also be used to filter the table (green box).Clicking on a cell in the "ID" column will initiate a search using that spectrum as a query.algorithms to perform open modifications search against spectral libraries.Like Spectroscape, after creating a high-dimensional vector representation of each peptide in the library, the software package constructs an index structure that efficiently retrieves the closest library entries to a query spectrum.This approach is similar to Spectroscape.But there are several key differences in their default settings.First, ANN-Solo search for peptide candidates with a mass tolerance of 300Da or 500Da, while Spectroscape search as not mass tolerance constraint.Second, Spectroscape calculates a simple dot product, while ANN-Solo uses a more sophisticated similarity measure that considers mass-shifted peak matches.Third, ANN-Solo supports target-decoy search strategy to control FDR, while Spectroscape does not currently support FDR control.

Supplementary Tables
To compare the speed of the two software tools, we used them to search two files in the PXD000561 dataset against the NIST Human HCD library (build May 19, 2020).The default search parameters of ANN-Solo were altered, with the maximum accepted m/z being 6000 and setting the "peak shifts" option to false, to better mimic Spectroscape's similarity measure.We found that while the index time of Spectroscape is about six times longer than that of ANN-Solo, Spectroscape search time is about three times faster (Supplementary Table 1).Since the indexing of a spectral library only needs to be done once for multiple searches, the search time comparison is more relevant practically.
Next, we analyzed the spectrum-spectrum matches (SSMs) filtered at 0.1% or 1% FDR by ANN-Solo, and compared them to the SSMs found by Spectroscape of the same spectra.We found that the two sets of SSMs had good agreement, with an overlap of around 70%. Furthermore, we observed that this overlap could be increased by 5% by counting the lower hits of Spectroscape (up to rank 10), and by 2% if we consider sequences differing by one amino acid as matched (Supplementary Table 2).

Supplementary Note 2. Offline clustering and correction of unidentified and misidentified spectra
Although Spectroscape is primarily designed for on-demand visualization, its backend can potentially be used for offline spectrum clustering of the entire spectral archive.For any given query, Spectroscape returns a list of ANNs with their true dot products to the query.For each query spectrum, we have their list of ANNs and their true dot products.If we iterate over the whole archive, using each spectrum as a query to Spectroscape, we can construct an adjacency matrix from the returned list of ANNs.Using the adjacency matrix, we connect any pair of spectra with a true dot product greater than 0.7.Connected (directly or indirectly) spectra are considered one cluster.For illustrative purpose, we only connect spectra with the same precursor charge and similar precursor m/z (within 1 Th), such that spectra in the same cluster can be presumed to come from the same peptide ion.We applied a simple and relatively conservative heuristic to re-annotate unidentified and misidentified spectra: if 60% or more of the spectra in a cluster are annotated to the same peptide ion (by MSFragger at 1% FDR), we consider them the "majority" with the correct identification of the cluster and re-annotate all other spectra in the cluster to the same peptide ion.
Of the over 15 million unidentified spectra in the PXD000561 dataset, about 4 million belong to a cluster that includes at least one identified member.(The rest are either not clustered with any other spectrum or belonging to a cluster with no identified member at all, and thus can never be re-annotated by clustering.)Out of those that can be potentially "rescued," 589,162 (14.7%) can be re-annotated by the 60% majority rule.Among them, 308,949 (52.4%) were actually identified by MSFragger to the same peptide ion, but fell short of the confidence threshold for 1% FDR, which partly validated this strategy of rescuing unidentified spectra.With the same heuristic, another 110,374 spectra with a conflicting identification with the majority are deemed misidentified and are re-annotated.In total, about nearly 700,000 spectra are re-annotated.The nodes of the re-annotated spectra are re-colored and displayed in Spectroscape with a 'R' in the middle of the node.
each of the subspace are further sorted into 256 "sub-buckets" by k-means clustering, similar to the IVF step.Finally, the residual vector is approximated by a direct sum of the centroids of the sub-buckets, : ≈ ̂= ( 1 ) ⊕ ( 2 ) ⊕ ( 3 ) ⊕ ⋯ ⊕ ( 16 ) The spectrum vector is finally approximated by: The "address" of the spectrum vector is simply a concatenation of the bucket number and the 16 sub-bucket numbers.Both the bucket number and the sub-bucket number have 256 possible values, so each can be specified in 1 byte.Hence, the address is 17-byte long.
At query time, the query spectrum undergoes the same spectrum preprocessing steps, and its vector representation  is computed.The address is found by the same algorithm, as follows: where  is the residual vector of , and ̂ is the approximation of  based on the direct sum of centroids.To retrieve approximate nearest neighbors, the algorithm first finds the  nearest bucket centroids ( can be set by the query-time parameter nprobes), and collects all the existing addresses (each corresponding to a spectrum in the spectral archive) in those buckets.These are the candidates of the approximate nearest neighbors.However, instead of returning all these candidates, the algorithm performs efficient approximate dot product calculations between the query's address and the candidates' addresses: Note that the dot product of two direct sums is distributive (provided the subspaces are the same), and can be written as a direct sum of dot products, e.g.: Thus, the approximate dot product is calculated by summing inter-centroid dot products.Since the centroids are known at training, all inter-centroid dot products can be computed beforehand and stored in a lookup table.Therefore, no multiplication is necessary and only about 50 additions are required for each approximate dot product calculation.This is further accelerated by parallel computing in GPUs.The algorithm then returns the most similar 1,024 approximate nearest neighbors (ANNs) among the candidates, as ranked by this approximate dot product.
To improve the recall, multiple distinct indices can be generated for the same spectral archive.
To produce a different index, a different random sample of spectra is used in the training, and a different random partitioning of bins is used during the decomposition into subspaces in the product quantization step.At query time, each index will return its own set of 1,024 ANNs, many, but not all, of which will be common between indices.The union of the ANNs returned by all indices are passed to the next step for true dot product calculations and clustering.
In summary, the IVF-PQ algorithm approximates each spectrum vector as a sum of centroids for efficient dot product calculations.The address is 17-byte long, which means the addresses of 1 billion spectra can theoretically fit in 17 GB, small enough to be loaded into memory for most modern computers for fast processing.In terms of time complexity, retrieval of ANNs is expected to scale linearly with the archive size.This is because the number of spectra in each bucket, and hence the number of candidates for approximate dot product calculations, is expected to grow linearly with archive size.The efficient and parallelizable nature of the algorithm, however, helps to keep the running time manageable, especially with continuous hardware improvement (e.g., addition of more GPUs).Moreover, as more and more spectra are accumulated in spectral repositories, an increasing fraction of them (at least in their preprocessed and vectorized forms) will be identical or nearly identical, as they should ultimately originate from a finite set of observable peptide ions.Future development of the algorithm should aim to reduce this redundancy.
similarities.It has no query of visualization function.Spectroscape, on the other hand, is a visualization tool of a spectral archive.The only input to Spectroscape's indexing algorithm are the peak lists of all the spectra, and Spectroscape is agnostic about the identification or additional information such as the precursor mass.The IVF-PQ indexing of Spectroscape seeks to approximate the spectrum geometrically in high-dimensional space, so that spectral similarities can be computed quickly for ANN retrieval.Giving a query spectrum, Spectroscape first uses the IVF-PQ index to retrieve its ANNs.Then it computes pairwise true spectral similarities among the ANNs and displays the neighborhood of any given query as a "network," or in computer science term, a graph.This is done in real time.Spectroscape does not cluster all the input spectra beforehand.Unlike GLEAMS, Spectroscape does not output the mapping of each individual spectrum to a spectrum cluster it belongs to.Rather, Spectroscape allows the user to verify any PSM by observing its "neighborhood" in a spectral archive.To facilitate this application, the identification results of all spectra are loaded and used to color-code the nodes of the graph, after the fact.It is worth emphasizing that unlike GLEAMS, Spectroscape does not take advantage of the information of the peptide identification or precursor mass in IVF-PQ indexing the ANN retrieval.
With this functional distinction of the two tools in mind, while we cannot compare the outputs of the two tools, we can compare their ability to preserve the true dot product of pairs of spectra in their respective spectrum representations ("embedding" for GLEAMS, "address" for Spectroscape).For any pair of spectra, we can calculate the true dot product, Spectroscape approximate dot product, and GLEAMS approximate dot product (of the corresponding embedding vectors).We searched 20,000 randomly selected queries against the spectral archive and collected the query-neighbor spectrum pairs and their Spectroscape approximate dot product.
The true dot product scores between these spectrum pairs are also calculated.Note that Spectroscape returns about 2,000 query-neighbor spectrum pairs for each query, which would have filtered away pairs with very low similarity and of no practical relevance.The retained spectrum pairs thus form a more appropriate test set to evaluate the ability of the ANN retrieval algorithms.
The corresponding GLEAMS approximate dot product scores between these same spectrum pairs are also calculated.(To correctly find the corresponding embedding of a spectrum, we use the filename and scan number as universal identifier for every spectrum.)We use scatter plots to visualize the correlation between the three scores, GLEAMS dot product vs true dot product, This enables us to maintain higher m/z compared to the 4096-dimension vector used for index building.Thus, the spectrum q is vectorized by binning the m/z axis into 65536 bins (with bin This efficient dot product calculation avoids any comparison of m/z values, and the vectorization of the spectra of the ANNs.The only steps required is to generate () from the query spectrum q, and the dot product calculation which requires 50 multiplications and 49 additions.
For determining the whole cluster structures among the retained TNNs, pairwise accurate dot product calculations are necessary.To do so, each of the TNNs take the place of the query spectrum q in the above algorithm.
Supplementary Method 5 | Visualization and force-directed graph drawing Spectroscape displays the queried spectrum as a node in a graph, together with the top N true nearest neighbors (where N is 20 by default but can be adjusted at query time) in its "neighborhood."The true nearest neighbors (TNNs) were obtained by calculating accurate dot products between the query and the retrieved ANNs.For the displayed nodes of TNNs, pairwise of the edge (the massless spring) is initialized to be a logistic transformation of the Euclidean distance:  = 30 + 80 1 +  20𝑑−12 Force-directed graph drawing can minimize crossing edges and overlapping nodes in graph drawing.More importantly, it has the effect of clearly displaying clusters of densely-connected nodes, while sending loosely-connected nodes farther away, facilitating the detection of subclusters, outliers and bridges.
The user can also choose to disconnect the edges between neighbors regardless of their similarities, so the resulting graph is star-shaped with the query spectrum in the middle.The cluster structure will not come into play.Concentric circles marking a scale of Euclidean distances from the center (query) node can be overlaid on the graph for easy visualization (Supplementary Figure 4).
Mousing over any node will display information about the node, including its identification (if any), precursor m/z, confidence of identification (shown as PeptideProphet and iProphet probabilities), source file and scan number, etc. Single-clicking any node will display its corresponding full spectrum in a spectrum viewer, with peak annotation if the identification is known.The user can also further explore the displayed "neighborhood" by double clicking on any node, which will initiate a new query with the clicked node as the query spectrum, replacing the old one.Alternatively (in an option available in a menu opened by right-clicking), one can expand the neighborhood by adding neighbors of any clicked node, without replacing the old graph.
An option also exists for the user to enter a peak list into the web interface in a text box (in which each line is a peak represented by the m/z and the intensity, separated by space), and search it against the archive.This query spectrum is not added into the archive, but Spectroscape will retrieve its nearest neighbors among all spectra in the archive and display them as nodes in clusters (Supplementary Figure 5).

Table 3 |
Re-annotation of unidentified and misidentified spectra by spectrum clustering, in a spectral archive of ~26 million spectra in the PXD000561 dataset Spectroscape can be used to perform a conventional library search, with the inherent ability to consider unexpected modification since it does not use the precursor m/z values to select candidates.To demonstrate this potential application, we compared the performance of Spectroscape and ANN-Solo, a open modification spectral library search engine.ANN-Solo is a software package that utilizes approximate nearest neighbor (ANN) search